Paquete Base

Base de datos en : https://raw.githubusercontent.com/unalyticsteam/bases-de-datos/master/auto-mpg.csv

# cargando base de datos
auto <- read.csv("../../data/tema2/auto-mpg.csv")
str(auto)

## 'data.frame':    398 obs. of  9 variables:
##  $ No          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ mpg         : num  28 19 36 28 21 23 15.5 32.9 16 13 ...
##  $ cylinders   : int  4 3 4 4 6 4 8 4 6 8 ...
##  $ displacement: num  140 70 107 97 199 115 304 119 250 318 ...
##  $ horsepower  : int  90 97 75 92 90 95 120 100 105 150 ...
##  $ weight      : int  2264 2330 2205 2288 2648 2694 3962 2615 3897 3755 ...
##  $ acceleration: num  15.5 13.5 14.5 17 15 15 13.9 14.8 18.5 14 ...
##  $ model_year  : int  71 72 82 72 70 75 76 81 75 76 ...
##  $ car_name    : Factor w/ 305 levels "amc ambassador brougham",..: 66 184 165 86 8 18 11 79 42 112 ...

Notemos que la estructura del dataframe auto nos muestra la variable cylinders (cilindros) como variable tipo numérico y nuestro conocimiento de la base nos indica que esta deber ser categórica, por tanto la convertimos como sigue.

auto$cylinders <- factor(auto$cylinders)
str(auto)

## 'data.frame':    398 obs. of  9 variables:
##  $ No          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ mpg         : num  28 19 36 28 21 23 15.5 32.9 16 13 ...
##  $ cylinders   : Factor w/ 5 levels "3","4","5","6",..: 2 1 2 2 4 2 5 2 4 5 ...
##  $ displacement: num  140 70 107 97 199 115 304 119 250 318 ...
##  $ horsepower  : int  90 97 75 92 90 95 120 100 105 150 ...
##  $ weight      : int  2264 2330 2205 2288 2648 2694 3962 2615 3897 3755 ...
##  $ acceleration: num  15.5 13.5 14.5 17 15 15 13.9 14.8 18.5 14 ...
##  $ model_year  : int  71 72 82 72 70 75 76 81 75 76 ...
##  $ car_name    : Factor w/ 305 levels "amc ambassador brougham",..: 66 184 165 86 8 18 11 79 42 112 ...

Ahora la variable cylinders es categórica y esta posee 5 niveles

head(auto, 5)

##   No mpg cylinders displacement horsepower weight acceleration model_year
## 1  1  28         4          140         90   2264         15.5         71
## 2  2  19         3           70         97   2330         13.5         72
## 3  3  36         4          107         75   2205         14.5         82
## 4  4  28         4           97         92   2288         17.0         72
## 5  5  21         6          199         90   2648         15.0         70
##              car_name
## 1 chevrolet vega 2300
## 2     mazda rx2 coupe
## 3        honda accord
## 4     datsun 510 (sw)
## 5         amc gremlin

Usemos la función attach para poder acceder a las variables del dataframe directamente, o sea si necesidad que tener que acceder por índices o con el signo dólar $

attach(auto)

Histograma

Para hacer un histograma de frecuencias se usa la función hist(), cuyos parámetros (argumentos) principales son:

x: vector de valores.
breaks: cantidad de divisiones del histograma
col: para indicar el color del histograma
xlab: etiqueta del eje x
ylab: etiqueta del eje y
main: titulo principal del gráfico

hist(acceleration, col ="red", xlab = "Aceleracion",ylab = "Frecuencias", breaks= 16, main = "Histograma de la aceleración")

Density

Para combinar el histograma con el gráfico de densidad se necesita hacer este con las frecuencias relativas y no las absolutas, esto se indica con el parametro prob=TRUE y la curva que representa la densidad se grafica con la función lines(). Para dibujar solo la densidad de usa el siguiente código plot(density(acceleration))

hist(acceleration, probability = TRUE)
lines(density(acceleration))
# linea con vertical de la media.
abline(v=mean(acceleration), col="red", lwd=5)

Boxplot

Para hacer un diagrma de caja y bigotes se usa la función boxplot()

boxplot(mpg, xlab= "Millas por galón", main = "Diagrama de caja y bigotes de mpg")

# para agregar la media (con linea y con punto)
points(mean(mpg), col="red", pch = 19)
abline(h=mean(mpg), col = "blue")

Creando boxplox de mpg (millas por galón) diferenciando por model_year (anño del modelo)

boxplot(mpg ~ model_year, main="Millas por galón por año")

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# calculemos las medias por año
mu <- auto %>% group_by(model_year) %>% summarise(medias_year = mean(mpg))

Grafiquemos las medias por cada año.

boxplot(mpg ~ model_year, main="Millas por galón por año")
points(mu$medias_year, col="red", pch=19, type = "b")
legend("topright", legend = "media", col = "red", pch = 19)

Nota: En el graifco se puede observar que tanto en media como en mediana el numero de millas recorridas por galón muestra una tendencia creciente a medida que aumentan los años.

Scatterplot

Para hacer un diagrama de dispersión se usa la función plot()

plot (mpg ~ horsepower)

Nota: En este gráfico podermos observar la relacion de la variable mpg en funcion de la variable hoserpower y notamos que a mayor horsepower menor son las millas que recorre por galón.

Grafiquemos usando colores

plot (mpg ~ horsepower, col=cylinders)

Grafiquemos usando la función plot y los puntos función poinst para tener mas control sobre los colores de las distintas categorías.

plot(mpg ~ horsepower, type ="n")
with(subset(auto, cylinders==3), points(mpg~horsepower, col="skyblue"))
with(subset(auto, cylinders==4), points(mpg~horsepower, col="blue"))
with(subset(auto, cylinders==5), points(mpg~horsepower, col="green"))
with(subset(auto, cylinders==6), points(mpg~horsepower, col="orange"))
with(subset(auto, cylinders==8), points(mpg~horsepower, col="red"))
# agregemos una leyenda
legend("topleft", legend = levels(cylinders), col=c("skyblue","blue","green","orange","red"), pch=19)

Matriz Scatterplot

Para hacer una matriz de dispersión se usa la función pairs

plot(~mpg+displacement+horsepower+weight)

# plot(auto[, c("mpg","displacement","horsepower","weight")])
# tambien se puede usar la función pairs()

Nota: Con este gráfico podemos observar la relación de las cada una de las variables curzadas con las otras.(NO tendencia)

bar plot

tabla <- table(cylinders)
barplot(tabla)

par() función.

La función par me sirve para configurar los parametros de los gráficos con los que voy a trabajar, para observar los parametros de los gráficos escriba en la consola el comando par()

# Ej: configurando el color del titulo y el color de fondo
par(col.main ="red", bg="gray")
plot(mpg, main = "Millas por galón")

Con la funcion par() tambien puedo configurar la cantidad de graficos a mostrar en el Viewer

# Para configuración con la cantidad de gráficos c(2,2) (dos filas y dos columas = 4 gráfcos)
par(mfrow=c(2,2))
#gráfico1
hist(mpg)
#gráfico2
plot(density(mpg))
#gráficos3
hist(mpg, probability = T)
lines(density(mpg))
#graficos4
hist(mpg, probability = T, col = "blue")
lines(density(mpg))

qqplot

Gráfico cuantil-cuantil, nos permite verficar si un conjunto de datos viene de una distribucion teorica dada, usemos la función qqnorm() para verificar normalidad (visualmente) de una variable dada.

qqnorm(mpg)

lattice

Gráficos especializados, elegantes y enfocado a datos multivariados.

para cargar libreria

# para instalar el paquete
#install.packages(lattice)
library(lattice)

## Warning: package 'lattice' was built under R version 3.5.3

Usemos el mismo dataframe

auto <- read.csv("../../data/tema2/auto-mpg.csv")
auto$cylinders <- factor(auto$cylinders)

bwplot(~auto$mpg | auto$cylinders, main = "Mpg según cilindros")

#require(dplyr)
bwplot(~mpg | cylinders, data=auto)

mu <- auto %>% group_by(cylinders) %>% summarise(medias = mean(mpg))

xyplot(mpg~weight | cylinders, data = auto, main = "Peso vs consumo vs cilindro ")

densityplot(mpg~weight | cylinders, data = auto)

ggplot2

ggplot2 es un paquete de R implementado por Hadley Wickham para la creación de gráfcos de alto nivel si necesidad de tener que preocuparnos por los detalles (leyendas, tipografia, colores). El paquete es basado en la gramática de gráficos (gg), un concepto publicado por Leland Wilkinson en 2005.

Documentación oficial: http://docs.ggplot2.org/current/

Acorde al concepto de ggplot2, un gráfico es la suma de 3 partes fundamentales

Gráfico = Datos + Estética + Geometría

Datos: El dataframe.

Estética:(aes) Sirve para indicar las variables x, y, para controlar el color, tamaño, forma de los puntos…

Geometria:(geom) Corresponde al tipo de gráfico (histograma, box plot, line plot, density plot, dot plot, ….)

Lo que vamos a necesitar:

# ggplot2: para visualización de datos.
install.packages("ggplot2")
# dplyr: para manipulacion de datos.
install.packages("dplyr")

Cargando librerias a la sesion de R studio.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'auto':
## 
##     mpg

library(dplyr)

Dos principales funciones de ggplot2 para la creación de gráficos.

qplot(): Una función de trazado rápido que es fácil de usar para diagramas simples.

ggplot(): Es una función más flexible que me permite construir un gráifco pieza por pieza

Función qplot()

Es una función de traazado rápido puede ser utilizado para crear rápida y fácilmente diferentes tipos de gráficos: gráficos de dispersión, gráficos de violín, histogramas, diagramas de densidad.

El formato simple de la funcion qplot es como sigue:

qplot(x, y = NULL, data, geom="auto")

Otros argumentos de la función qplot()

main xlab ylab

Usemos los datos mtcars que vienen incluidos en R

auto <- read.csv("https://raw.githubusercontent.com/unalyticsteam/bases-de-datos/master/auto-mpg.csv")
auto$cylinders <- factor(auto$cylinders)

Scatter plots

Diagrama de puntos o diagrama de dispersión. para representar relación entre dos variables cuantitativas.

qplot(x=mpg, y=weight, data=auto, geom="point", main="Peso vs mpg",
      xlab = "millas por galón",
      ylab = "Peso(1000lb)")

La función qplot() es muy parecida a la función plot() de sistema base de R

qplot(x=mpg, y =weight, data = auto, geom = c("point", "smooth"), method ="lm")

## Warning: Ignoring unknown parameters: method

Cambiando el color acorde a los valores de una variable continua.

qplot(x=mpg, y=weight, data = auto, color = acceleration)

cambiando color y forma acorde a los valores de una variable categoricas (factores)

qplot(x=mpg, y=weight, data= auto, col = cylinders, shape = cylinders)

Los colores puden ser controlados con variables continuas, con variables discretas y variables categóricas.

Box plot

qplot(x= cylinders , y= mpg, data=auto, geom ="boxplot")

qplot(x= cylinders , y= mpg, data=auto, geom ="boxplot") + stat_summary(fun.y = mean, geom="point", color = "red")

Violin plots

Los gráficos de violín son similares a los diagramas de caja y bigotes, solo que este muestra la funcion de densidad de los datos.

ggplot(data=auto, aes(x=cylinders, y=mpg)) + geom_violin() +
geom_boxplot(width = 0.2) + stat_summary(fun.y = mean, geom="point", color = "red")

histogram

qplot(x= mpg, data=auto, geom ="histogram", bins = 20) +
  geom_vline(aes(xintercept=mean(mpg)), color ="red", size=2)

Density plots

qplot(x= mpg, data=auto, geom ="density", bins = 20) + geom_vline(aes(xintercept=mean(mpg)), color = "red")

## Warning: Ignoring unknown parameters: bins

Función ggplot()

ggplot() es una función mas potente y felixible que qplot() y nos sirve para Construir gráficos pierza por pieza

Scatter plot

Diagrama de dispersión.

ggplot(data = auto, aes(x=weight, y=mpg)) + geom_point()

Ahora cambiemos la forma de los puntos

ggplot(data = auto, aes(x=weight, y=mpg)) + geom_point(size = 4, shape=18)

Las funciones geom_point() y geom_density() son capas, podemos combinar multiples capas a cada capa les puedo configurar su estética como color, tamaño, etc.

ggplot(data=auto, aes(x=weight, y=mpg)) + geom_point() + geom_line(color="red")

Usando diferentes datos para distintas capas.

se usa un subconjunto de los mismos datos para la capa line.

ggplot(data=auto, aes(x=weight, y=mpg)) + geom_point() +
  geom_line(data=head(auto,10), aes(x=weight, y=mpg),color="red")

Scatter plot con recta de regresión.

ggplot(data = auto, aes(x = weight, y =mpg)) + geom_point(alpha = 1/2, size=5, aes(color = cylinders)) +
  geom_smooth(method = "lm", col="blue") + geom_rug(aes(color=cylinders))

ggplot(data = auto, aes(x = weight, y =mpg)) + 
geom_point(aes(color = cylinders, shape = cylinders)) +
geom_smooth(aes(color = cylinders), method = lm, se = FALSE, fullrange = TRUE)

ggplot(data = auto, aes(x = weight, y =mpg)) + geom_point(alpha = 1/2, size=5, aes(color = cylinders)) + geom_smooth(method = "lm", col="blue") + facet_grid(cylinders~.) +
  labs(x="Peso") + labs(y="Millas por galón") + labs(title="mpg vs peso")

ggplot(data = auto, aes(x = weight, y =mpg)) + geom_point(alpha = 1/2, aes(size = horsepower, color = horsepower)) +
  geom_smooth(method = lm, col="blue")

Regresión o suavizado

geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)

Método: método desuavizado a utilizar. Los valores posibles son lm, glm, gam, loess, rlm.

formula = y ~ poly(x, 3)

level :nivel de confianza a usar. por Default el valor es de 0.95

Densidad

ggplot(data=auto, aes(x=mpg)) + geom_density() + geom_vline(aes(xintercept = mean(mpg)), color ="red", size = 2, linetype ="dashed")

Tambien se puede hacer con.

ggplot(data=auto, aes(x=mpg)) + stat_density()

Cambiando el color por grupos.

library(dplyr)
mu <- auto %>% group_by(cylinders) %>%  summarise(medias=mean(mpg))

ggplot(data=auto, aes(x=mpg)) + geom_density(aes(color=cylinders, fill=cylinders)) + geom_vline(data=mu, aes(xintercept=medias, color =cylinders))

Cambiando manualmente los colores

scale_color_manual()

scale_fill_manual()

scale_color_brewer()

scale_fill_brewer()

scale_color_grey()

scale_fill_grey()

ggplot(data=auto, aes(x=mpg)) + geom_density(aes(color=cylinders)) + geom_vline(data=mu, aes(xintercept=medias, color =cylinders)) + scale_color_grey()

ggplot(data=auto, aes(x=mpg)) + geom_density(aes(color=cylinders)) + geom_vline(data=mu, aes(xintercept=medias, color =cylinders)) +
  scale_color_manual(values = c("#c1b741",  "#4286f4", "#999999", "#E69F00", "#ce1c2e"))

histograma.

ggplot(data= auto, aes(x=acceleration)) + geom_histogram(bins = 30, color = "gray") +
  geom_vline(aes(xintercept=mean(acceleration)), color = "red") + geom_vline(aes(xintercept=median(acceleration)), color = "blue")

Histogram y Density Plots

ggplot(data= auto, aes(x=acceleration)) + geom_histogram(aes(y=..density..)) +
  geom_density(color = "red")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Line plot

Se usa para visualizar relaciones entre variables continuas

Se usara la base de datos mtcars que viene por defecto en R

ggplot(data = mtcars, aes(x=wt, y=mpg)) + geom_line(linetype ="dashed", color = "red")

ggplot(data = mtcars, aes(x=wt, y=mpg)) + geom_line(aes(color = as.factor(carb))) +
  labs(color = "Carburadores")

QQplot.

Quantile - Quantile plot, se utilizan para verificar si un conjuto de datos dado Sigue la distribución normal.

ggplot(data= auto, aes(sample=mpg)) + stat_qq(data=auto,aes(color=cylinders))

barplot

Para contar frecuencias de una variable categorica

ggplot(data=auto, aes(x=cylinders)) + geom_bar(color = "blue", fill="white")

# crenado otra variable categorica
auto$weigth2 <- ifelse(auto$weight > 2804, "muy pesado", "poco pesado")
auto$weigth2 <- factor(auto$weigth2)
suma <- auto %>% group_by(cylinders, auto$weigth2) %>% summarise(suma2= sum(mpg))


ggplot(data=auto, aes(x= cylinders, y = mpg, fill=weigth2)) + geom_bar(stat = "identity")

Anotaciones textuales

ggplot(data=auto[1:20,], aes(x=mpg, y=weight)) + geom_text(aes(label = car_name),
size = 3)

## ggrepel
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 3.5.3

ggplot(data=auto[1:5,], aes(x=mpg, y=weight)) + 
  geom_point() + geom_label_repel(aes(label=car_name))

Plot types

Trabajemos con la base diamonds (diamantes) que viene incluido en R

ggplot(data= diamonds, aes(x=carat, y=price)) + geom_hex(bins=50)

## Warning: package 'hexbin' was built under R version 3.5.3

Pie Charts

coord_polar()

df <- data.frame(
group = c("Male", "Female", "Child"),
value = c(25, 25, 50))
head(df)

##    group value
## 1   Male    25
## 2 Female    25
## 3  Child    50

ggplot(df, aes(x="", y = value, fill=group)) +
geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0)

Titulos, etiqueta de los ejes

The function below can be used for changing titles and labels:

p + ggtitle(“Main title”): Adds a main title above the plot

p + xlab(“X axis label”): Changes the X axis label

p + ylab(“Y axis label”): Changes the Y axis label

p + labs(title = “Main title”, x = “X axis label”, y = “Y axis label”):

ggplotly()

library(ggplot2)
library(plotly)

## Warning: package 'plotly' was built under R version 3.5.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

gapminder <- read.csv("https://raw.githubusercontent.com/unalyticsteam/bases-de-datos/master/gapminder.csv")

p <- ggplot(data = gapminder, aes(y=lifeExp, x= gdpPercap, colour = continent, alpha = 0.5)) + geom_point(aes(size=pop, frame=year, ids=country)) + scale_x_log10()

## Warning: Ignoring unknown aesthetics: frame, ids

print(p)

ggplotly(p)

Organizar múltiples gráficos en el Misma página

library("gridExtra")
library("cowplot")

plot_grid(grafico1, grafico2, grafico3, labels = c("A", "B", "C"),
ncol = 2, nrow = 2)

Correlation Matrix

Guardar gráficos.

# para guardar graficos en pdf
pdf("myplot.pdf")
myplot <- ggplot(mtcars, aes(wt, mpg)) + geom_point()
print(myplot)
dev.off()

#para guardar graficos en png.
png("myplot.png")
print(myplot)
dev.off()

# 1. Create a plot: displayed on the screen (by default)
ggplot(mtcars, aes(wt, mpg)) + geom_point()
# 2.1. Save the plot to a pdf
ggsave("myplot.pdf")
# 2.2 OR save it to png file
ggsave("myplot.png")

Graficación con R

Heber Esteban Bermúdez González

May 11, 2019